This practical is based on exploratory data analysis and prediction of a dataset derived from a municipal database of healthcare administrative data. This dataset is derived from Vitoria, the capital city of Espírito Santo, Brazil (population 1.8 million) and was freely shared under a creative commons license.

Generate an rmarkdown report that contains all the necessary code to document and perform: EDA, prediction of no-shows using XGBoost, and an analysis of variable/feature importance using this data set. Ensure your report includes answers to any questions marked in bold. Please submit your report via brightspace as a link to a git repository containing the rmarkdown and compiled/knitted html version of the notebook.

Introduction

The Brazilian public health system, known as SUS for Unified Health System in its acronym in Portuguese, is one of the largest health system in the world, representing government investment of more than 9% of GDP. However, its operation is not homogeneous and there are distinct perceptions of quality from citizens in different regions of the country. Non-attendance of medical appointments contributes a significant additional burden on limited medical resources. This analysis will try and investigate possible factors behind non-attendance using an administrative database of appointment data from Vitoria, Espírito Santo, Brazil.

The data required is available via the course website.

Understanding the data

1. Use the data dictionary describe each of the variables/features in the CSV in your report. a) PatientID: Unique identifier for each patient b) AppointmentID: Unique identifier to each appointment c) Gender: Patient Gender (limited to Male or Female) d) ScheduledDate: date on which the appointment was scheduled e) AppointmentDate: date of the actual appointment f) Age: Patient age g) Neighbourhood: District of Vitória in which the appointment h) SocialWelfare: Patient is a recipient of Bolsa Família welfare payments i) Hypertension: Patient previously diagnoised with hypertensio (Boolean) j) Diabetes: Patient previously diagnosed with diabetes (Boolean) k) AlcoholUseDisorder: Patient previously diagnosed with alcohol use disorder (Boolean) l) Disability: Patient previously diagnosed with a disability (severity rated 0-4) m) SMSReceived: At least 1 reminder text sent before appointment (Boolean) n) NoShow: Patient did not attend scheduled appointment (Boolean: Yes/No)

2. Can you think of 3 hypotheses for why someone may be more likely to miss a medical appointment? a) Younger people will be more likely to miss an appointment. b) Those who have not received an SMS reminder text will be more likely to miss an appointment. c) Appointments scheduled well in advance of the actual appointment date will be more likely to have a no-show.

3. Can you provide 3 examples of important contextual information that is missing in this data dictionary and dataset that could impact your analyses a) There is no indication of the type of appointment in the dataset. b) The dataset does not include any information about which neighborhood the patient lives in (so as to make a comparison between their home and appointment locations). c) There’s no data for the doctor for each appointment. This information could help determine if certain doctors have more missed appointments. This is assuming there are more than one doctor per clinic.

Data Parsing and Cleaning

4 Modify the following to make it reproducible i.e., downloads the data file directly from version control

data_path <- 'data/2016_05v2_VitoriaAppointmentData.csv'
data_link <- 'https://raw.githubusercontent.com/maguire-lab/health_data_science_research/master/static_files/practicals/lab1_data/2016_05v2_VitoriaAppointmentData.csv'
raw.data <- read_csv(data_link, col_types='fffTTifllllflf')
#raw.data <- read_csv(data_path)

Now we need to check data is valid: because we specified col_types and the data parsed without error most of our data seems to at least be formatted as we expect i.e., ages are integers

raw.data %>% filter(Age > 110)
## # A tibble: 5 × 14
##   PatientID   AppointmentID Gender ScheduledDate       AppointmentDate       Age
##   <fct>       <fct>         <fct>  <dttm>              <dttm>              <int>
## 1 3196321161… 5700278       F      2016-05-16 09:17:44 2016-05-19 00:00:00   115
## 2 3196321161… 5700279       F      2016-05-16 09:17:44 2016-05-19 00:00:00   115
## 3 3196321161… 5562812       F      2016-04-08 14:29:17 2016-05-16 00:00:00   115
## 4 3196321161… 5744037       F      2016-05-30 09:44:51 2016-05-30 00:00:00   115
## 5 7482345792… 5717451       F      2016-05-19 07:57:56 2016-06-03 00:00:00   115
## # … with 8 more variables: Neighbourhood <fct>, SocialWelfare <lgl>,
## #   Hypertension <lgl>, Diabetes <lgl>, AlcoholUseDisorder <lgl>,
## #   Disability <fct>, SMSReceived <lgl>, NoShow <fct>

We can see there are 2 patient’s older than 100 which seems suspicious but we can’t actually say if this is impossible.

5 Are there any individuals with impossible ages? If so we can drop this row using filter i.e., data <- data %>% filter(CRITERIA)

raw.data <- raw.data %>% filter(Age >= 0)

Yes, there are patients with ages less than 0. This is definitely impossible.

Exploratory Data Analysis

First, we should get an idea if the data meets our expectations, there are newborns in the data (Age==0) and we wouldn’t expect any of these to be diagnosed with Diabetes, Alcohol Use Disorder, and Hypertension (although in theory it could be possible). We can easily check this:

raw.data %>% filter(Age == 0) %>% select(Hypertension, Diabetes, AlcoholUseDisorder) %>% unique()
## # A tibble: 1 × 3
##   Hypertension Diabetes AlcoholUseDisorder
##   <lgl>        <lgl>    <lgl>             
## 1 FALSE        FALSE    FALSE

We can also explore things like how many different neighborhoods are there and how many appoints are from each?

count(raw.data, Neighbourhood, sort = TRUE)
## # A tibble: 81 × 2
##    Neighbourhood         n
##    <fct>             <int>
##  1 JARDIM CAMBURI     7717
##  2 MARIA ORTIZ        5805
##  3 RESISTÊNCIA        4431
##  4 JARDIM DA PENHA    3877
##  5 ITARARÉ            3514
##  6 CENTRO             3334
##  7 TABUAZEIRO         3132
##  8 SANTA MARTHA       3131
##  9 JESUS DE NAZARETH  2853
## 10 BONFIM             2773
## # … with 71 more rows

6 What is the maximum number of appointments from the same patient?

count(raw.data, PatientID, sort = TRUE)
## # A tibble: 62,298 × 2
##    PatientID           n
##    <fct>           <int>
##  1 822145925426128    88
##  2 99637671331        84
##  3 26886125921145     70
##  4 33534783483176     65
##  5 258424392677       62
##  6 871374938638855    62
##  7 6264198675331      62
##  8 75797461494159     62
##  9 66844879846766     57
## 10 872278549442       55
## # … with 62,288 more rows

The maximum number of appointments from the same patient is 88.

Let’s explore the correlation between variables:

# let's define a plotting function
corplot = function(df){
  
  cor_matrix_raw <- round(cor(df),2)
  cor_matrix <- melt(cor_matrix_raw)
  
  
  #Get triangle of the correlation matrix
  #Lower Triangle
  get_lower_tri<-function(cor_matrix_raw){
    cor_matrix_raw[upper.tri(cor_matrix_raw)] <- NA
    return(cor_matrix_raw)
  }
  
  # Upper Triangle
  get_upper_tri <- function(cor_matrix_raw){
    cor_matrix_raw[lower.tri(cor_matrix_raw)]<- NA
    return(cor_matrix_raw)
  }
  
  upper_tri <- get_upper_tri(cor_matrix_raw)
  
  # Melt the correlation matrix
  cor_matrix <- melt(upper_tri, na.rm = TRUE)
  
  # Heatmap Plot
  cor_graph <- ggplot(data = cor_matrix, aes(Var2, Var1, fill = value))+
    geom_tile(color = "white")+
    scale_fill_gradient2(low = "darkorchid", high = "orangered", mid = "grey50", 
                         midpoint = 0, limit = c(-1,1), space = "Lab", 
                         name="Pearson\nCorrelation") +
    theme_minimal()+ 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                     size = 8, hjust = 1))+
    coord_fixed()+ geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
    theme(
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      panel.grid.major = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank(),
      axis.ticks = element_blank())+
      ggtitle("Correlation Heatmap")+
      theme(plot.title = element_text(hjust = 0.5))
  
  cor_graph
}

numeric.data = mutate_all(raw.data, function(x) as.numeric(x))

# Plot Correlation Heatmap
corplot(numeric.data)

Correlation heatmaps are useful for identifying linear relationships between variables/features. In this case, we are particularly interested in relationships between NoShow and any specific variables.

7 Which parameters most strongly correlate with missing appointments (NoShow)?

A patient receiving a text message (SMSReceived) and the appointment ID correlate most strongly with the patient being a no-show.

8 Are there any other variables which strongly correlate with one another?

There are a few other strong correlations, such as (appointment ID and receiving a text message), (age and hypertension), (age and diabetes), and (hypertension and diabetes)

9 Do you see any issues with PatientID/AppointmentID being included in this plot?

The issue with including IDs in this plot is that they are unrelated to a patient in any way. Since the patient IDs are generated presumably based on the order in which a patient becomes a part of an institution and appointment IDs for ordering of appointments, they likely don’t reveal actual correlations. The one exception to this could eb something like text messages, as a clinic might not have implemented the SMS messaging until after a number of these appointments occurred.

Let’s look at some individual variables and their relationship with NoShow.

ggplot(raw.data) + 
  geom_density(aes(x=Age, fill=NoShow), alpha=0.8) + 
  ggtitle("Density of Age by Attendence")

There does seem to be a difference in the distribution of ages of people that miss and don’t miss appointments.
However, the shape of this distribution means the actual correlation is near 0 in the heatmap above. This highlights the need to look at individual variables.

Let’s take a closer look at age by breaking it into categories.

raw.data <- raw.data %>% mutate(Age.Range=cut_interval(Age, length=10))

ggplot(raw.data) + 
  geom_bar(aes(x=Age.Range, fill=NoShow)) + 
  ggtitle("Amount of No Show across Age Ranges")

ggplot(raw.data) + 
  geom_bar(aes(x=Age.Range, fill=NoShow), position='fill') + 
  ggtitle("Proportion of No Show across Age Ranges")

10 How could you be misled if you only plotted 1 of these 2 plots of attendance by age group?

From the first plot, we see that people older than 70 make up very few of the total appointments. When looking at only the second plot, one might assume that those groups are contributing a lot to the number of missed appointments, especially the (110, 120] cohort. However, when we look at both plots together, we can see that the actual number of missed appointments by those >= 70 years old is much smaller than the second plot makes it look.

raw.data %>% filter(Age == 0) %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  3539

Next, we’ll have a look at SMSReceived variable:

ggplot(raw.data) + 
  geom_bar(aes(x=SMSReceived, fill=NoShow), alpha=0.8) + 
  ggtitle("Density of SMS received across Age and No Show")

ggplot(raw.data) + 
  geom_bar(aes(x=SMSReceived, fill=NoShow), position='fill', alpha=0.8) + 
  ggtitle("Density of SMS received across Age and No Show")

11 From this plot does it look like SMS reminders increase or decrease the chance of someone not attending an appointment? Why might the opposite actually be true (hint: think about biases)?

From the plot, it looks like SMS reminders increase the likelihood of someome not attending an appointment. However, based on the fact that far fewer older patients receive appointment reminders (but are more likely to attend an appointment), then it’s likely that the SMS reminders are actually working as intended.

12 Create a similar plot which compares the the density of NoShow across the values of disability

ggplot(raw.data) + 
  geom_bar(aes(x=Disability, fill=NoShow), alpha=0.8) + 
  ggtitle("Density of No Show across Disability")

ggplot(raw.data) + 
  geom_bar(aes(x=Disability, fill=NoShow), position='fill', alpha=0.8) + 
  ggtitle("Proportion of No Show across Disability")

In these plots we see that the vast majority of patients have no disabilities, and also account for most of the no-shows. Comparing this to the proportions plot, the proportions suggest that the no-shows are distributed comparatively evenly.

Now let’s look at the neighbourhood data as location can correlate highly with many social determinants of health.

ggplot(raw.data) + 
  geom_bar(aes(x=Neighbourhood, fill=NoShow)) + 
  theme(axis.text.x = element_text(angle=45, hjust=1, size=5)) + 
  ggtitle('Attendance by Neighbourhood')

ggplot(raw.data) + 
  geom_bar(aes(x=Neighbourhood, fill=NoShow), position='fill') + 
  theme(axis.text.x = element_text(angle=45, hjust=1, size=5)) + 
  ggtitle('Proportional Attendance by Neighbourhood')

Most neighborhoods have similar proportions of no-show but some have much higher and lower rates.

13 Suggest a reason for differences in attendance rates across neighbourhoods.

The age demographics of each neighborhood could be very similar to the others. Something like a population pyramid would help make this comparison.

Now let’s explore the relationship between gender and NoShow.

ggplot(raw.data) + 
  geom_bar(aes(x=Gender, fill=NoShow))

  ggtitle("Gender by attendance")
## $title
## [1] "Gender by attendance"
## 
## attr(,"class")
## [1] "labels"
ggplot(raw.data) + 
  geom_bar(aes(x=Gender, fill=NoShow), position='fill')

  ggtitle("Porportional Gender by attendance")
## $title
## [1] "Porportional Gender by attendance"
## 
## attr(,"class")
## [1] "labels"

14 Create a similar plot using SocialWelfare

ggplot(raw.data) + 
  geom_bar(aes(x=SocialWelfare, fill=NoShow))

  ggtitle("Social Welfare Usage by attendance")
## $title
## [1] "Social Welfare Usage by attendance"
## 
## attr(,"class")
## [1] "labels"
ggplot(raw.data) +
  geom_bar(aes(x=SocialWelfare, fill=NoShow), position='fill') + 
  ggtitle("Proportional Social Welfare Usage by Attendance")

Far more exploration could still be done, including dimensionality reduction approaches but although we have found some patterns there is no major/striking patterns on the data as it currently stands.

However, maybe we can generate some new features/variables that more strongly relate to the NoShow.

Feature Engineering

Let’s begin by seeing if appointments on any day of the week has more no-show’s. Fortunately, the lubridate library makes this quite easy!

raw.data <- raw.data %>% mutate(AppointmentDay = wday(AppointmentDate, label=TRUE, abbr=TRUE), 
                                 ScheduledDay = wday(ScheduledDate,  label=TRUE, abbr=TRUE))

ggplot(raw.data) +
  geom_bar(aes(x=AppointmentDay, fill=NoShow)) +
  ggtitle("Amount of No Show across Appointment Day") 

ggplot(raw.data) +
  geom_bar(aes(x=AppointmentDay, fill=NoShow), position = 'fill') +
  ggtitle("Amount of No Show across Appointment Day") 

Let’s begin by creating a variable called Lag, which is the difference between when an appointment was scheduled and the actual appointment.

raw.data <- raw.data %>% mutate(Lag.days=difftime(AppointmentDate, ScheduledDate, units = "days"),
                                Lag.hours=difftime(AppointmentDate, ScheduledDate, units = "hours"))

ggplot(raw.data) + 
  geom_density(aes(x=Lag.days, fill=NoShow), alpha=0.7)+
  ggtitle("Density of Lag (days) by attendance")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

15 Have a look at the values in lag variable, does anything seem odd?

There is a massive spike in appointments with next to no lag time between scheduling an appointment and the date of the appointment. It seems as though a very large portion of appointments are booked last-minute. This could be due to the clinics trying to fill in appointments that were cancelled at the last minute. Alternatively, the clinics could have a long wait line for walk-in appointments, which would explain so many appointments occurring with no lead time.

Predictive Modeling

Let’s see how well we can predict NoShow from the data.

We’ll start by preparing the data, followed by splitting it into testing and training set, modeling and finally, evaluating our results. For now we will subsample but please run on full dataset for final execution.

data.prep <- raw.data %>% select(-AppointmentID, -PatientID)

set.seed(42)
data.split <- initial_split(data.prep, prop = 0.7)
train  <- training(data.split)
test <- testing(data.split)

Let’s now set the cross validation parameters, and add classProbs so we can use AUC as a metric for xgboost.

fit.control <- trainControl(method="cv",number=3,
                           classProbs = TRUE, summaryFunction = twoClassSummary)

16 Based on the EDA, how well do you think this is going to work?

Based on the exploratory data analysis, I think the model will work fairly well. We’ve discovered the variables in the dataset that most strongly correlate with no-shows, and found that the ‘lag’ between an appointments scheduling time and scheduled time have a very strong negative correlation. As a result, I think that the lag will have the largest impact for predicting no-shows.

Now we can train our XGBoost model

xgb.grid <- expand.grid(eta=c(0.05),
                       max_depth=c(4),colsample_bytree=1,
                       subsample=1, nrounds=500, gamma=0, min_child_weight=5)

xgb.model <- train(NoShow ~ .,data=train, method="xgbTree",metric="ROC",
                  tuneGrid=xgb.grid, trControl=fit.control)

xgb.pred <- predict(xgb.model, newdata=test)
xgb.probs <- predict(xgb.model, newdata=test, type="prob")
test <- test %>% mutate(NoShow.numerical = ifelse(NoShow=="Yes",1,0))
confusionMatrix(xgb.pred, test$NoShow, positive="Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  26423  6409
##        Yes   122   204
##                                           
##                Accuracy : 0.803           
##                  95% CI : (0.7987, 0.8073)
##     No Information Rate : 0.8006          
##     P-Value [Acc > NIR] : 0.1313          
##                                           
##                   Kappa : 0.0408          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.030848        
##             Specificity : 0.995404        
##          Pos Pred Value : 0.625767        
##          Neg Pred Value : 0.804794        
##              Prevalence : 0.199439        
##          Detection Rate : 0.006152        
##    Detection Prevalence : 0.009832        
##       Balanced Accuracy : 0.513126        
##                                           
##        'Positive' Class : Yes             
## 
paste("XGBoost Area under ROC Curve: ", round(auc(test$NoShow.numerical, xgb.probs[,2]),3), sep="")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] "XGBoost Area under ROC Curve: 0.742"

This isn’t an unreasonable performance, but let’s look a bit more carefully at the correct and incorrect predictions,

xgb.probs$Actual = test$NoShow.numerical
xgb.probs$ActualClass = test$NoShow
xgb.probs$PredictedClass = xgb.pred
xgb.probs$Match = ifelse(xgb.probs$ActualClass == xgb.probs$PredictedClass,
                         "Correct","Incorrect")
# [4.8] Plot Accuracy
xgb.probs$Match = factor(xgb.probs$Match,levels=c("Incorrect","Correct"))
ggplot(xgb.probs,aes(x=Yes,y=Actual,color=Match))+
  geom_jitter(alpha=0.2,size=0.25)+
  scale_color_manual(values=c("grey40","orangered"))+
  ggtitle("Visualizing Model Performance", "(Dust Plot)")

Finally, let’s close it off with the variable importance of our model:

results = data.frame(Feature = rownames(varImp(xgb.model)$importance)[1:10],
                     Importance = varImp(xgb.model)$importance[1:10,])

results$Feature = factor(results$Feature,levels=results$Feature)


# [4.10] Plot Variable Importance
ggplot(results, aes(x=Feature, y=Importance,fill=Importance))+
  geom_bar(stat="identity")+
  scale_fill_gradient(low="grey20",high="orangered")+
  ggtitle("XGBoost Variable Importance")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

17 Using the caret package fit and evaluate 1 other ML model on this data.

I will be using Caret to fit a random forest model to the data.

data.prep <- raw.data %>% select(-AppointmentID, -PatientID)

set.seed(42)
data.split <- initial_split(data.prep, prop = 0.7)
train  <- training(data.split)
test <- testing(data.split)
fit.control <- trainControl(method="cv",number=3,
                           classProbs = TRUE, summaryFunction = twoClassSummary)
rf.model <- train(NoShow ~ .,data=train, method="rf",
                  metric="ROC", trControl=fit.control)

rf.pred <- predict(rf.model, newdata=test)
rf.probs <- predict(rf.model, newdata=test, type="prob")
test <- test %>% mutate(NoShow.numerical = ifelse(NoShow=="Yes",1,0))
confusionMatrix(rf.pred, test$NoShow, positive="Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  25420  5401
##        Yes  1125  1212
##                                           
##                Accuracy : 0.8032          
##                  95% CI : (0.7989, 0.8075)
##     No Information Rate : 0.8006          
##     P-Value [Acc > NIR] : 0.1171          
##                                           
##                   Kappa : 0.1861          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.18328         
##             Specificity : 0.95762         
##          Pos Pred Value : 0.51861         
##          Neg Pred Value : 0.82476         
##              Prevalence : 0.19944         
##          Detection Rate : 0.03655         
##    Detection Prevalence : 0.07048         
##       Balanced Accuracy : 0.57045         
##                                           
##        'Positive' Class : Yes             
## 
paste("Random Forest Area under ROC Curve: ", round(auc(test$NoShow.numerical, xgb.probs[,2]),3), sep="")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] "Random Forest Area under ROC Curve: 0.742"
rf.probs$Actual = test$NoShow.numerical
rf.probs$ActualClass = test$NoShow
rf.probs$PredictedClass = rf.pred
rf.probs$Match = ifelse(rf.probs$ActualClass == rf.probs$PredictedClass,
                         "Correct","Incorrect")
# [4.8] Plot Accuracy
rf.probs$Match = factor(rf.probs$Match,levels=c("Incorrect","Correct"))
ggplot(rf.probs,aes(x=Yes,y=Actual,color=Match))+
  geom_jitter(alpha=0.2,size=0.25)+
  scale_color_manual(values=c("grey40","orangered"))+
  ggtitle("Visualizing Model Performance", "(Dust Plot)")

As we can see from the performance plot, the random forest model does a much poorer job of fitting to the dataset. I would not consider using this model over the xgboost one.

18 Based on everything, do you think we can trust analyses based on this dataset? Explain your reasoning.

While I do think there is some value to the analyses done on this dataset, I feel that there are some major reasons to be hesitant in doing so. It looks like there are far too many appointments that are booked and held on the same/next day to be able to do sound analyses, as shown by the XGBoost feature importance graph. All of the other features are significantly less important. I would attribute this point to there being comparatively little chance for a patient to have something pop up causing them to miss an appointment if the appointment was booked on the day of.

When considering the other features of the dataset, the proportions of show/no-shows are fairly uniform across the board. While something like receiving an SMS message reminder could help with reducing the number of no-shows in the dataset, it would be unfair to evaluate the perfomance of that without knowing that the SMS messaging service was implemented for all appointments in the dataset.

In short, I would consider the attendance of an appointment mostly dependent on the time between appointment booking and appointment date.

Credits

This notebook was based on a combination of other notebooks e.g., 1, 2, 3